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Popularity of content in social media is unequally distributed, with some items receiving a disproportionate 
share of attention from users. Predicting which newly- submitted items will become popular is critically im- 
portant for both hosts of social media content and its consumers. Accurate and timely prediction would enable 
hosts to maximize revenue through differential pricing for access to content or ad placement. Prediction would 
also give consumers an important tool for filtering the ever-growing amount of content. Predicting popularity 
of content in social media, however, is challenging due to the complex interactions between content quality 
and how the social media site chooses to highlight content. Moreover, most social media sites also selectively 
present content that has been highly rated by similar users, whose similarity is indicated implicitly by their 
behavior or explicitly by links in a social network. While these factors make it difficult to predict popularity 
a priori, we show that stochastic models of user behavior on these sites allows predicting popularity based on 
early user reactions to new content. By incorporating the various mechanisms through which web sites display 
content, such models improve on predictions based on simply extrapolating from the early votes. Using data 
from one such site, the news aggregator Digg, we show how a stochastic model of user behavior distinguishes 
the effect of the increased visibility due to the network from how interested users are in the content. We find a 
wide range of interest, identifying stories primarily of interest to users in the network ("niche interests") from 
those of more general interest to the user community. This distinction is useful for predicting a story's eventual 
popularity from users' early reactions to the story. 



I. INTRODUCTION 

Success or popularity in social media is not evenly dis- 
tributed. Instead, a small number of users dominate the activ- 
ity on the site and receive most of the attention of other users. 
The popularity of contributed items likewise shows extreme 
diversity. For example, relatively few of the four billion im- 
ages on the social photo- sharing site Flickr are viewed thou- 
sands of times, while most of the rest are rarely viewed. Of 
the tens of thousands of new stories submitted daily to the so- 
cial news portal Digg, only a handful go on to become wildly 
popular, gathering thousands of votes, while most of the re- 
maining stories never receive more than a single vote from the 
submitter herself. Among thousands of new blog posts ev- 
ery day, only a handful become widely read and commented 
upon. Given the volume of new content, it is critically im- 
portant to provide users with tools to help them sift through 
the vast stream of new content to identify interesting items in 
a timely manner, or least those items that will prove to be 
successful or popular. Accurate and timely prediction will 
also enable social media companies that host user-generated 
content to maximize revenue through differential pricing for 
access to content or ad placement, and encourage greater user 
loyalty by helping their users quickly find interesting new con- 
tent. 

Success in social media is difficult to predict. Although 
early and late popularity, which can be measured in terms of 
user interest, e.g., votes or views, an item generates from its 
inception, are somewhat correlated (13]|3U, we know little 
about what drives success. Does success derive mainly from 
an item's inherent quality Q, users' response to it fTOlL or 
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some external factors, such as social influence I23lj25l ? In a 
landmark study, Salganik et al. |34 ] addressed this question 
experimentally by measuring the impact of content quality 
and social influence on the eventual popularity or success of 
cultural artifacts. They showed that while quality contributes 
only weakly to their eventual success, social influence, or 
knowing about the choices of other people, is responsible for 
both the inequality and unpredictability of success. In their 
experiment, Salganik et al. asked users to rate songs they lis- 
tened to. The users were assigned to different groups. In the 
control group (independent condition), users were simply pre- 
sented with lists of songs. In the other group (social influence 
condition), users were also shown how many times each song 
was downloaded by other users. The social influence condi- 
tion resulted in large inequality in popularity of songs, mea- 
sured by the number of times the songs were downloaded. Al- 
though a song's quality, as measured by its popularity in the 
control group, was positively related to its eventual popular- 
ity in the social condition group, the variance in popularity at 
a given quality was very high. This means that two songs of 
similar quality could end up with vastly different levels of suc- 
cess. Moreover, when users were aware of the choices made 
by others, popularity was also unpredictable, meaning that on 
repeating the experiment, the same song could end up with a 
very different level of popularity. 

Although Salganik et al.'s study was limited to a small set 
of songs created by unknown bands, its conclusions about in- 
equality and unpredictability of popularity appear to apply to 
cultural artifacts in general and social media production in 
particular. While this would appear to prohibit prediction of 
popularity, we argue that understanding how the collective be- 
havior of Web users emerges from the decisions made by in- 
terconnected individuals allows us to predict the popularity of 
items from the users' early reaction to them. As in previous 
works (l5l[T6l[22l, we use a stochastic modeling framework 
to mathematically describe the social dynamics of Web users. 
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This framework represents each user and each submitted item 
as a stochastic process with a few states, e.g., a simple Markov 
processes whose future state depends only on its present state 
and the input it receives. We used this approach to study col- 
lective user activity on the social news aggregator Digg. We 
produced a model that partially explains — and predicts l27l 
— the social voting patterns on Digg and related these aggre- 
gate behaviors to the ways Digg enables users to discover new 
content. While this model included social influence, i.e., the 
increased visibility of stories to a user's neighbors in the social 
network, it did not address the commonality of users' interests 
indicated by links. This phenomenon, known as homophily, is 
a key aspect of social networks. In this paper we describe a 
new extension to the model that accounts for systematic varia- 
tions of interests within and outside of the network. We make 
further changes to the model to more closely match it to web 
site behavior. First, the new model's state transition rates ac- 
count for the daily variation in user activity l35lL thereby fo- 
cusing on variations of votes on individual stories compared 
to the average activity rate on the site. Second, we account for 
the variation in number of votes a story receives before it is 
promoted, which the prior model ignored. 

By separating the impact of story quality and social influ- 
ence on the popularity of stories on Digg, a stochastic model 
of social dynamics enables two novel applications: (1) esti- 
mating inherent story quality from the evolution of its ob- 
served popularity, and (2) predicting its eventual popularity 
based on the early reaction of users to the story. Specifically, 
to predict how popular a story will become, we can use the 
early votes, even those cast before the story is promoted, to 
estimate how interesting it is to voters. With this estimate, 
the model then determines, on average, the story's subsequent 
evolution. We study these claims empirically on a sample of 
stories retrieved from Digg. We show that by adjusting for 
the differing interests among voters, the new model improves 
predictions of popularity from early reactions of users. 

The paper is organized as follows. In Section|II|we describe 
details of the social news aggregator Digg, which provides an 
empirical foundation and a data set for investigating the util- 
ity of stochastic models on the prediction task. Section III 
presents an overview of the stochastic modeling framework. 



In Section IV we apply the framework to study dynamics of 
social voting on Digg. We review an existing model of so- 
cial dynamics of Digg and show that it explains many of the 
empirically observed features of aggregate behavior of voters 
on that site. In Section M we extend this model to include 



variations in story interest to users. Then, in Section VI 



we 



show how the model can predict eventual popularity of newly 
submitted stories on Digg. 



II. SOCIAL NEWS PORTAL DIGG 

With over 3 million registered users, the social news ag- 
gregator Digg is one of the more popular news portals on 
the Web. Digg allows users to submit and rate news stories 
by voting on, or 'digging', them. There are many new sub- 
missions every minute, over 16,000 a day. Every day Digg 



picks about a hundred stories that it believes will be most in- 
teresting to the community and promotes them to the front 
page. Although the exact promotion mechanism is kept secret 
and changes occasionally, it appears to take into account the 
number of votes the story receives and how rapidly it receives 
them. Digg's success is fueled in large part by the emergent 
front page, which is created by the collective decision of its 
many users. 

While the life cycle of each story may be drastically differ- 
ent from others, its basic elements are the same. These are 
specified by Digg's user interface, which defines how users 
can post or discover new stories and interact with other users. 
A model of social dynamics has to take these elements into 
account when describing the evolution of story popularity. 



A. User interface 

A newly submitted story goes on the upcoming stories list, 
where it remains for a period of time, typically 24 hours, or 
until it is promoted to the front page, whichever comes first. 
The default view shows newly submitted stories as a chrono- 
logically ordered list, with the most recently submitted story 
at the top of the list, 15 stories to a page. To see older stories, 
a user must navigate to page 2, 3, etc. of the upcoming sto- 
ries list. Promoted stories (Digg calls them 'popular') are also 
displayed as a chronologically ordered list on the front pages, 
15 stories to a page, with the most recently promoted story at 
the top of the list. To see older promoted stories, user must 



navigate to page 2, 3, etc. of the front page. Figure |ITA| shows 
a screenshot of a Digg front page. Users vote for the stories 
they like by 'digging' them. The yellow badge to the left of 
each story shows its current popularity. 

Digg also allows users to designate friends and track their 
activities, i.e., see the stories friends recently submitted or 
voted for. The friends interface is available through the 
"Friends' Activity" link at the top of any Digg web page (see, 
for example, Fig. |II A[ ). The friend relationship is asymmet- 
ric. When user A lists user B as & friend, A can watch the 
activities of B but not vice versa. We call A the fan of B. A 
newly submitted story is visible in the upcoming stories list, as 
well as to submitter's fans through the friends interface. With 
each vote, a story becomes visible to the voter's fans through 
the friends interface, which shows the newly submitted stories 
that user's friends voted for. 

In addition to these interfaces, Digg also allows users to 
view the most popular stories from the previous day, week, 
month, or year. Digg also implements a social filtering fea- 
ture which recommends stories, including upcoming stories, 
that were liked by users with a similar voting history. This 
interface, however, was not available at the time the data for 
our study was collected and hence is not part of the stochastic 
models described in this paper. Thus we examine a period of 
time where Digg had a relatively simple user interface, which 
simplifies the stochastic models. This choice allow us to fo- 
cus on evaluating the stochastic model approach, particularly 
the empirical question of whether averaging is useful in the 
social media setting in spite of long-tail distributions, which 
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FDA to be a ggressive in tobacco regulation 

courier-joumal.com— If there is any doubt about how aggressive the federal Food and 
Drug Administration intends to be in regulating tobacco, take a look at a letter the 
agency sent out last week. 
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Can a Daily Pill Really Boost Your Brain Power? 

guardian.co.uk— In America, university students are taking illegally 
obtained prescription drugs to make them more intelligent. Here, an 
investigation into the brave new world of neuro enhancement... 
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account... 

revision3.com - Parents just don't understand. We don't either- what 
WAS he trying to do with that remote control? 
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Three suspects arrested in U.S. terrorism probe 

reuters.com — A Colorado man, his father and an accused accomplice in New York 
were arrested on Saturday and charged with lying to federal agents about a plot to 
blow up unspecified targets in the United States, the U.S. Department of Justice 
said. 
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FIG. 1: Screenshot of the front page of the social news aggregator Digg. 



contrasts with the narrow distributions found in most statisti- 
cal physics settings. 



B. Dynamics of popularity 

While a story is in the upcoming stories list, it accrues votes 
slowly. If the story is promoted to the front page, it accumu- 
lates votes at a much faster pace. Figure |2ja) shows evolution 
of the number of votes for two stories submitted in June 2006. 
The point where the slope abruptly increases corresponds to 
promotion to the front page. The vast majority of stories are 
never promoted and, therefore, never experience the sharp rise 
in the number of votes that accompanies being featured on 
the front page. As the story ages, accumulation of new votes 
slows down (38), and after a few days the total number of 
votes received by a story saturates to some value. This value, 
which we also call the final number of votes, gives a measure 
of the story's success or popularity. 

Popularity varies widely from story to story. Figure [2fb) 



shows the distribution of the final number of votes received by 
front page stories that were submitted over a period of about 
two days in June 2006. The distribution is characteristic of 
'inequality of popularity', since a handful of stories become 
very popular, accumulating thousands of votes, while most 
others can only muster a few hundred votes. This distribu- 
tion applies to front page stories only. Stories that are never 
promoted to the front page receive very few votes, in many 
cases just a single vote from the submitter. Such distributions 
are also called 'long tailed' distributions. This means that in 
systems displaying such distributions extreme events, e.g., a 
story receiving many thousands of votes, occur much more 
frequently than would be expected if the underlying processes 
were Poisson or Gaussian in nature. 

The long tail is a ubiquitous feature of human activity. 
It is present in inequality of popularity of cultural artifacts, 
such as books and music albums l34ll , and also manifests it- 
self in a variety of online behaviors, including tagging, where 
a few documents are tagged much more frequently than oth- 
ers, collaborative editing on wikis [ 21 ], and general social me- 
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FIG. 2: Dynamics of social voting, (a) Evolution of the number of votes received by two front page stories in June 2006. (b) Distribution of 
popularity of 201 front page stories submitted in June 2006. 



dia usage f37l . The same distribution of popularity was also 
observed in a sample of more than 30,000 stories promoted to 
Digg's front page over the course of a year (38). 

While unpredictability of popularity is more difficult to ver- 
ify than in the controlled experiments of Salganik et al., it is 
reasonable to assume that a similar set of stories submitted to 
Digg on another day will end with radically different numbers 
of votes. In other words, while the distribution of the final 
number of votes these stories receive will look similar to the 
distribution in Figure[2jb), the number of votes received by in- 
dividual stories will be very different in the two realizations. 



C. Data collection 

We collected data for the study by scraping Digg's Web 
pages in May and June 2006. The May data set consists of 
stories that were submitted to Digg May 25-27, 2006. We 
followed these stories by periodically scraping Digg to de- 
termine the number of votes stories received as a function of 
the time since their submission. We collected at least 4 such 
observations for each of 2152 stories, submitted by 1212 dis- 
tinct users. Of these stories, 510, by 239 distinct users, were 
promoted to the front page. We followed the promoted sto- 
ries over a period of several days, recording the number of 
votes the stories received. This May data set also records the 
location of the stories on the upcoming and front pages as a 
function of time. 

The June data set consists of 201 stories promoted to the 
front page between June 27 and 30, 2006. For each story, we 
collected the names of its first 216 voters. 

We focus our data collection on the early stages of story 
evolution - from submission until shortly after promotion. 
The reason for this is that the Digg social network has a much 
larger effect on upcoming than front page stories due to the 
much more rapid addition of stories to the upcoming list. This 
large influx of stories makes it difficult for users to find a new 
story before it becomes hidden by the arrival of more stories. 
In this case, enhanced visibility via the network for fans of the 
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FIG. 3: Voting rate (diggs per hour) on front page stories at the end of 
June 2006. The indicated dates are the start of each day (0:00 GMT). 
The minimum in daily activity is around 9am GMT. Each point is the 
average voting rate for 100 successive votes. 



submitter or early voters is particularly important, and a model 
of social dynamics has to account for it. In light of these ob- 
servations, and for speeding up data collection, we focus on 
the early votes for stories. 

Activity on Digg varies considerably over the course of a 
day, as seen in Fig. [3] Adjusting times by the cumulative ac- 
tivity on the site accounts for this variation and improves pre- 
dictions l35l . We define the "Digg time" between two events 
(e.g., votes on a story) as the total number of votes on front 
page stories during the time between those events. In our data 
set, there are on average about 2500 such votes per hour, with 
a range of about a factor of 4 in this rate during the course 
of a day. This behavior is similar to that seen in an extensive 
study of front page activity in 2007 l35lL and as in that study 
we scale the measure by defining a "Digg hour" to be the av- 
erage number of front page votes in an hour, i.e., 2500 for our 
data set. We evaluate the consequence of this variability by 
contrasting a model based on real time (in Sec. |TV| ) with one 
based on Digg time (in Sec.fV]). 
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In addition to voter activity, we also extracted a snapshot 
of the social network of the top-ranked 1020 Digg users as 
of June 2006. This data contained the names of each user's 
friends and fans. Since the original network did not contain in- 
formation about all the voters in our data, we augmented it in 
February 2008 by extracting names of friends of about 15, 000 
additional users. Many of these users added friends between 
June 2006 and February 2008. Although Digg does not pro- 
vide the time a new link was created, it lists the links in reverse 
chronological order and gives the date the friend joined Digg. 
By eliminating friends who joined Digg after June 30, 2006, 
we were able to reconstruct the fan links for all voters in our 
data. This data allows us to identify, for each vote, whether 
the user was a fan of any prior voter on that story, in which 
case the story would have appeared in the friends interface for 
that user. 

Votes by fans account for 6% of the votes in the June data 
set and about 3% of the front page votes. 

The data sets used in this and previous works were collected 
before Digg's API was introduced. Scraping Web pages to 
extract data had several issues. First, data had to be manu- 
ally cleaned to ensure consistency. Second, since vote time 
stamps were not available on the Web page, we had to supple- 
ment June 2006 data by using the Digg API in October 2009 
to obtain the time of each vote, the final number of votes the 
story received, and the time of promotion. In the intervening 
time, however, some of the users had deleted their accounts. 
Since we could not easily resolve the time of the vote of an 
inactive user, we had to delete these users from the voters list. 
We believe that the small fraction of data lost in this manner 
(less than 8% of the data) does not adversely affect the mod- 
eling study. However, in the future we plan to repeat the study 
on a much cleaner data set obtained through Digg API. 



III. STOCHASTIC MODELS OF SOCIAL DYNAMICS 

Rather than account for the inherent variability of individ- 
uals, stochastic models focus on describing the macroscopic, 
or aggregate, behavior of the system, which can be described 
by average quantities. In the context of Digg, such quanti- 
ties include average rate at which users post new stories and 
vote on existing stories. Such macroscopic descriptions often 
have a simple form and are analytically tractable. Stochastic 
models do not reproduce the results of a single observation — 
rather, they describe the 'typical' behavior. These models are 
analogous to the approach used in statistical physics, demo- 
graphics and macroeconomics where the focus is on relations 
among aggregate quantities, such as volume and pressure of a 
gas, population of a country and immigration, or interest rates 
and employment. 

We represent each individual entity, whether a user or a 
story, as a stochastic process with a small number of states. 
This abstraction captures much of the individual complexity 
and environmental variability by casting user's decisions as 
inducing probabilistic transitions between states. While this 
modeling framework applies to stochastic processes of vary- 
ing complexity, for simplicity, we focus on simple processes 



that obey the Markov property, namely, a user whose future 
state depends only on her present state and the input she re- 
ceives. A Markov process can be succinctly captured by a 
state diagram showing the possible states of the user and con- 
ditions for transition between those states. 

We assume that all users have the same set of states, and 
that transitions between states depend only on the state and 
not the individual user. That is, the state captures the key rel- 
evant properties determining subsequent user actions. Then, 
the aggregate state of the system can be described simply by 
the number of individuals in each state at a given time. That 
is, the system configuration at this time is defined by the oc- 
cupation vector: ft = (ni, n2, . . .) where n k is the number of 
individuals in state k. For example, in the context of a given 
story on Digg, one of the states for a user could be "has voted 
for the story". The component of the occupation vector corre- 
sponding to this state is the number of users who have voted 
for this story, without regard for which particular users those 
are. 

The next step in developing the stochastic model is to sum- 
marize the variation within the collection of histories with a 
probabilistic description. That is, we characterize the possible 
occupation vectors by the probability, P(n,t), the system is 
in configuration ft at time t. The evolution of P(n,t), gov- 
erned by the Stochastic Master Equation [19], is almost al- 
ways too complex to be analytically tractable. Fortunately we 
can simplify the problem by working with the average occu- 
pation number, whose evolution is given by the Rate Equation 

= ^Z w 3k({n)){rij) - (n k )^2w kj ((n)) (1) 

3 3 

where (n k ) denotes the average number of users in state k at 
time t, i.e., J2<n n kP(n-> t) and Wjk((n)) is the transition rate 
from configuration j to configuration k when the occupation 
vector is (ft). 

Using the average of the occupation vector in the transition 
rates is a common simplifying technique for stochastic mod- 
els. A sufficient condition for the accuracy of this approxima- 
tion is that variations around the average are relatively small. 
In many stochastic models of systems with large numbers of 
components, variations are indeed small due to many inde- 
pendent interactions among the components. More elaborate 
versions of the stochastic approach give improved approxima- 
tions when variations are not small, particularly due to corre- 
lated interactions f32l . User behavior on the web, however, 
often involves distributions with long tails, whose typical be- 
haviors differ significantly from the average 01371. In this 
case we have no guarantee that the averaged approximation 
is adequate. Instead we must test its accuracy for particular 
aggregate behaviors by comparing model predictions with ob- 
servations of actual behavior, as we report below. 

In the Rate Equation, occupation number n k increases due 
to users' transitions from other states to state k, and decreases 
due to transitions from the state k to other states. The equa- 
tions can be easily written down from the user state diagram. 
Each state corresponds to a dynamic variable in the mathe- 
matical model — the average number of users in that state 
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FIG. 4: State diagram of user behavior for a single story. A user 
starts in the state at the left, may find the story through one of the 
three interfaces and may then vote on it. At a given time, the story 
is located on a particular page of either the upcoming or front page 
lists, not both. This diagram shows votes for a story on either page 
p of the front pages or page q of the upcoming pages. Only fans of 
previous voters can see the story through the friends interface. Users 
in the friends, front or upcoming states may choose to leave Digg, 
thereby returning to the state (with those transitions not shown in 
the figure). Users reaching the "vote" state remain there indefinitely 
and can not vote on the story again. Parameters next to the arrows 
characterize state transitions. 



— and it is coupled to other variables via transitions between 
states. Every transition must be accounted for by a term in the 
equation, with transition rates specified by the details of the 
interactions between users. 

In summary, the stochastic modeling framework requires 
specifying the aggregate states of interest for describing the 
system and how individual user behaviors create transitions 
among these states. The modeling approach is best suited to 
cases where the users' decisions are mainly determined by a 
few characteristics of the user and the information they have 
about the system. These system states and transitions give 
the rate equations. Solutions to these equations then give es- 
timates of how aggregate behavior varies in time and depends 
on the characteristics of the users involved. 



IV. A MODEL OF SOCIAL DYNAMICS OF DIGG 



cently submitted stories, or use the friends interface to see the 
stories her friends have recently submitted or voted for. She 
can select a story to read from one of these pages and, if she 
considers it interesting, vote for it. The user's environment, 
the stories she is seeing, changes in time due to the actions of 
all the users. 

We characterize the changing state of a story by three val- 
ues: the number of votes, iV vote (t), the story has received by 
time t after it was submitted to Digg, the list the story is in at 
time t (upcoming or frontpage) and its location within that 
list, which we denote by q and p for upcoming and front page 
lists, respectively. 

With Fig. [4] as a modeling blueprint, we relate the users' 
choices to the changes in the state of a single story. In terms 
of the general rate equation (Eq. [I]), the occupancy vector ft 
describing the aggregate user behavior at a given time has the 
following components: the number of users who see a story 
via one of the front pages, one of the upcoming pages, through 
the friends pages, and number of users who vote for a story, 
7V vote . Since we are interested in the number of users who 
reach the vote state, we do not need a separate equation for 
each state in Fig. |4j at a given time, a particular story has a 
unique location on the upcoming or front page lists. Thus, 
for simplicity, we can group the separate states for each list in 
Fig. [4] and consider just the combined transition for a user to 
reach the page containing the story at the time she visits Digg. 
These combined transition rates depend on the location of the 
story in the list, i.e., the value of q or p for the story. With this 
grouping of user states, the rate equation for N YOtQ (t) is: 

dN ^f t] = r{v f {t) + u u (t) + iwis(t)) (2) 

where r measures how interesting the story is, i.e., the prob- 
ability a user seeing the story will vote on it, and Vf, u n and 
^friends are the rates at which users find the story via one of the 
front or upcoming pages, and through the friends interface, 
respectively. 

In this model, the transition rates appearing in the rate equa- 
tion depend on the time t but not on the occupation vector. 
Nevertheless, the model could be generalized to include such 
a dependence if, for example, a user currently viewing an in- 
teresting story not only votes on it but explicitly encourages 
people they know to view the story as well. 



Underlying a stochastic model of social dynamics is a be- 
havioral model of an individual Web user. The behavioral 
model takes into account the choices a Web site's user inter- 
face allows users. Detailed data about human activity that can 
be collected from social media sites such as Digg allow us 
to parameterize the models and test them by comparing their 
predictions to the observed collective dynamics. 

An earlier study of social dynamics of Digg [ 15 ] used a sim- 
ple behavioral model that viewed each Digg user as a stochas- 
tic Markov process, whose state diagram with respect to a sin- 
gle story is shown in Fig [4] According to this model, a user 
visiting Digg can choose to browse the front pages to see the 
recently promoted stories, upcoming stories pages for the re- 



A. Story Visibility 

Before we can solve Eq. [2] we must model the rates at 
which users find the story through the various Digg interfaces. 
These rates depend on the story's location in the list. The pa- 
rameters of these models depend on user behaviors that are 
not readily measurable. Instead, we estimate them using data 
collected from Digg, as described below. 

a. Visibility by position in list A story's visibility on 
the front page or upcoming stories lists decreases as recently 
added stories push it further down the list. The stories are 
shown in groups: the first page of each list displays the 15 
most recent stories, page 2 the next 15 stories, and so on. 
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We lack data on how many Digg visitors proceed to page 2, 
3 and so on in each list. However, when presented with lists 
over multiple pages on a web site, successively smaller frac- 
tions of users visit later pages in the list. One model of users 
following links through a web site considers users estimating 
the value of continuing at the site, and leaving when that value 
becomes negative ifTTl . This model leads to an inverse Gaus- 
sian distribution of the number of pages m a user visits before 
leaving the web site, 



X(m — /x) 



A 



27rm 3 



(3) 



with mean /i and variance /i 3 /A. This distribution matches 
empirical observations in several web settings 1 17 ]. When the 
variance is small, for intermediate values of m this distribu- 
tion approximately follows a power law, with the fraction of 
users leaving after viewing m pages decreasing as m -3 / 2 . 

To model the visibility of a story on the m th front or up- 
coming page, the relevant distribution is the fraction of users 
who visit at least m pages, i.e., the upper cumulative distribu- 
tion of Eq.[3] For m > 1, this fraction is 



/page(m) = \ (F m (-/i) - e 2A ^F m (/i)) 



(4) 



where F m (x) = erfc(a m (ra — 1 + x)/fi), erfc is the com- 
plementary error function, and a n 



m 



1, /page(l) = I- 



1)). For 



The visibility of stories decreases in two distinct ways when 
a new story arrives. First, a story moves down the list on its 
current page. Second, a story at the lb th position moves to the 
top of the next page. For simplicity, we model these processes 
as decreasing visibility, i.e., the value of / pag e(^) 9 through 
m taking on fractional values within a page, i.e., m = 1.5 
denotes the position of a story half way down the list on the 
first page. This model is likely to somewhat overestimate the 
loss of visibility for stories among the first few of the 15 items 
on a given page since the top several stories are visible without 
requiring the user to scroll down the page. 

b. List position of a story Fig.[5ja) shows how the page 
number of a story on the two lists changes in time for three 
randomly chosen stories from our data set. The behavior is 
close to linear when averaging over the daily activity variation 
(shown in Fig. [3}. For simplicity in this model, we ignore 
this variation and take a story's page number on the upcoming 
page q and the front page p at time t to be 031 



pit) 

q(t) 



kf(t- 



L promotion 



) + l 



1 



(5) 
(6) 



where T promot i on is the time the story is promoted to the front 
page (or oo if the story is never promoted) and the slopes are 
given in Table [i] For a given story, p(t) is only defined for 
times t > T promotion and q(t) for t < T promotion . Since each 
page holds 15 stories, these rates are l/15 th the submission 
and promotion rates, respectively. 

c. Front page and upcoming stories lists Digg promi- 
nently shows the stories on the front page. The upcoming 



stories list is less popular than the front page. We model this 
fact by assuming a fraction c < 1 of Digg visitors proceed to 
the upcoming stories pages. 

We use a simple threshold to model how a story is promoted 
to the front page. Initially the story is visible on the upcoming 
stories pages. If and when the number of votes a story receives 
exceeds a promotion threshold h, the story moves to the front 
page. This threshold model approximates Digg's promotion 
algorithm as of May 2006, since in our data set we did not see 
any front page stories with fewer than 44 votes, nor did we 
see any upcoming stories with more than 42 votes. We take 
h = 40 as an approximation to the promotion algorithm. 

d. Friends interface The friends interface allows the 
user to see the stories her friends have (i) submitted, (ii) voted 
for, and (iii) commented on in the preceding 48 hours. Al- 
though users can take advantage of all these features, we only 
consider the first two. These uses of the friends interface are 
similar to the functionality offered by other social media sites: 
e.g., Flickr allows users to see the latest images his friends up- 
loaded, as well as the images a friend liked. 

The fans of the story's submitter can find the story via the 
friends interface. As additional people vote on the story, their 
fans can also see the story. We model this with s(t), the num- 
ber of fans of voters on the story by time t who have not yet 
seen the story. Although the number of fans is highly vari- 
able, the average number of additional fans from an extra vote 
when the story has N wote votes is approximately 



As aN~ 



(7) 



where a = 51 and b = 0.62, as illustrated in Fig. [5jb), show- 
ing the fit to the increment in average number of fans per vote 
over groups of 5 votes as given in the data. Thus early voters 
on a story tend to have more new fans (i.e., fans who are not 
also fans of earlier voters) than later voters. 

The model can incorporate any distribution for the times 
fans visit Digg. We suppose these users visit Digg daily, and 
since they are likely to be geographically distributed across 
all time zones, the rate fans discover the story is distributed 
throughout the day. A simple model of this behavior takes 
fans arriving at the friends page independently at a rate uj. As 
fans read the story, the number of potential voters gets smaller, 
i.e., s decreases at a rate us, corresponding to the rate fans 
find the story through the friends interface, ^Mends- We neglect 
additional reduction in s from fans finding the story without 
using the friends interface. 

Combining the growth in the number of available fans and 
its decrease as fans return to Digg gives 



ds 
dt 



• aK 7l 



dN V( 



dt 



(8) 



with initial value s(0) equal to the number of fans of the 
story's submitter, S. This model of the friends interface treats 
the pool of fans uniformly. That is we assume no difference in 
behavior, on average, for fans of the story's submitter vs. fans 
of other voters. 



8 




FIG. 5: (a) Current page number on the upcoming and front pages vs. time for three different stories. Time is measured from when the story 
first appeared on each page, i.e., time it was submitted or promoted, for the upcoming and front page points, respectively, (b) Increase in the 
number of distinct users who can see the story through the friends interface with each group of five new votes for the first 46 users to vote on 
a story. The points are mean values for 195 stories, including those shown in (a), and the curve is based on Eq.[7] The error bars indicate the 
standard error of the estimated means. 



parameter 


value 


rate general users come to Digg 


v — 600 users/hr 


fraction viewing upcoming pages 


c= 0.3 


rate a voters' fans come to Digg 


uj 0.12/hr 


page view distribution 


\i 0.6, A 0.6 


fans per new vote 


a = 51, 6 = 0.62 


vote promotion threshold 


/i = 40 


upcoming stories location 


k u = 3.60pages/hr 


front page location 


kf = 0.18pages/hr 


story specific parameters 


interestingness 


r 


number of submitter's fans 


S 



TABLE I: Model parameters. 



In summary, the rates in Eq.[2]are l40ll : 



Vi = ^/page(pW)e(JVvo te (t)-h) 

v. = cz// page (g(t))6(/i-iV vote (t))e(24hr-t) 

^friends = Us(t) 



where t is time since the story's submission and v is the rate 
users visit Digg. The first step function in Vf and v u indi- 
cates that when a story has fewer votes than required for pro- 
motion, it is visible in the upcoming stories pages; and when 
^Vyote(^) > K the story is visible on the front page. The second 
step function in v u accounts for a story staying in the upcom- 
ing list for at most 24 hours. We solve Eq. [2] subject to ini- 
tial condition Af vote (0) = 1, because a newly submitted story 
starts with a single vote, from the submitter. 



B. Model Parameters 

The solutions of Eq. [2] show how the number of votes re- 
ceived by a story changes in time. The solutions depend on 
the model parameters, of which only two parameters — the 
story's interestingness r and number of fans the submitter has 
S — change from one story to another. Therefore, we fix val- 
ues of the remaining parameters as given in Table [l| 

As described above, we estimate some of these parameters 
(such as the growth in list location, promotion threshold and 
fans per new vote) directly from the data. The remaining pa- 
rameters are not directly given by our data set (e.g., how of- 
ten users view the upcoming pages) and instead we estimate 
them based on the model predictions. The small number of 
stories in our data set, as well as the approximations made in 
the model, do not give strong constraints on these parameters. 
We selected one set of values giving a reasonable match to 
our observations. For example, the rate fans visit Digg and 
view stories via the friend's interface, given by cu in Table [TJ 
has 90% of the fans of a new voter returning to Digg within 
the next 19 hours. As another example of interpreting these 
parameter values, for the page visit distribution the values of 
fi and A in Table [I] correspond to about 1/6 of the users view- 
ing more than just the first page. These parameters could in 
principle be measured independently from aggregate behavior 
with more detailed information on user behavior. Measuring 
these values for users of Digg, or other similar web sites, could 
improve the choice of model parameters. 



C. Results 

The model describes the behavior of all stories, whether 
or not they are promoted to the front page. To illustrate the 
model results, we consider stories promoted to the front page. 
Fig. [6] shows the behavior of six stories. For each story, S 
is the number of fans of the story's submitter, available from 
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FIG. 6: Evolution of the number of votes received by six stories 
compared with model solution. 

S r final votes 

5 0.51 2229 

5 0.44 1921 
40 0.32 1297 
40 0.28 1039 
160 0.19 740 
100 0.13 458 

TABLE II: Parameters for the example stories, listed in decreasing 
order of total votes received by the story and hence corresponding to 
the curves in Fig.[6]from top to bottom. 



our data, and r is estimated to minimize the root-mean- square 
(RMS) difference between the observed votes and the model 
predictions. Table [TT] lists these values. 

Overall there is qualitative agreement between the data and 
the model, indicating that the features of the Digg user in- 
terface we considered can explain the patterns of collective 
voting. Specifically, the model reproduces three generic be- 
haviors of Digg stories: (1) slow initial growth in votes of 
upcoming stories; (2) more interesting stories (higher r) are 
promoted to the front page (inflection point in the curve) faster 
and receive more votes than less interesting stories; (3) how- 
ever, as first described in l23l . better connected users (high 
S) are more successful in getting their less interesting stories 
(lower r) promoted to the front page than poorly-connected 
users. These observations highlight a benefit of the stochastic 
approach: identifying simple models of user behavior that are 
sufficient to produce the aggregate properties of interest. 

The only significant difference between the data and the 
model is visible in the lower two lines of Fig. [6] In the data, 
a story posted by the user with S = 100 is promoted before 
the story posted by the user with S = 160, but saturates at 
smaller value of votes than the latter story. In the model, the 
story with larger r is promoted first and gets more votes. 

Thus while the stochastic model is primarily intended to 
describe typical story behavior, we see it gives a reasonable 
match to the actual vote history of individual stories. Nev- 
ertheless, there are some cases where individual stories dif- 
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FIG. 7: Story promotion as a function of S and r. The r values are 
shown on a logarithmic scale. The model predicts stories above the 
curve are promoted to the front page. The points show the S and 
r values for the stories in our data set: black and gray for stories 
promoted or not, respectively. 
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FIG. 8: Distribution of interestingness (i.e., r values) for the pro- 
moted stories in our data set compared with the best fit lognormal 
distribution. 



fer considerably from the model, particularly where an early 
voter happens to have an exceptionally large number of fans, 
thereby increasing the story's visibility to other users far more 
than the average number of new fans per vote. This varia- 
tion, a consequence of the long-tail distributions involved in 
social media, is considerably larger than seen, for example, in 
most statistical physics applications of stochastic models. The 
effect of such large variations is an important issue for ad- 
dressing the usefulness of the stochastic modeling approach 
for social media when applied to the behavior of individual 
stories. 

Fig. [7] shows parameters required for a story to reach the 
front page according to the model, and how that prediction 
compares to the stories in our data set. The model's prediction 
of whether a story is promoted is correct for 95% of the stories 
in our data set. For promoted stories, the correlation between 
S and r is —0.13, which is significantly different from zero 
(p-value less than 10 -4 by a randomization test). Thus a story 
submitted by a poorly connected user (small S) tends to need 
high interest (large r) to be promoted to the front page 1 23 1 . 

Figure [8] shows the estimated r values for the 510 promoted 
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r estimates for promoted stories 
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quantile of lognormal 
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FIG. 9: Quantile-quantile plot comparing observed distribution of r 
values with the lognormal distribution fit (thick curve). For compar- 
ison, the thin straight line from to 1 corresponds to a perfect match 
between the data and the distribution. 



stories in our data set have a wide range of interestingness 
to users. That is, even after accounting for the variation in 
visibility of the stories, there remains a significant range in 
how well stories appeal to users. Specifically, Fig. [9] shows 
these r values fit well to a lognormal distribution 



lognormal 



(/i,cr;r) 



1 



2tt ra 



exp 



Q - log(r))' 
2a 2 



(9) 



where parameters [i and a are the mean and standard devia- 
tion of log(r). For the distribution of interestingness values, 
the maximum likelihood estimates of the mean and standard 
deviation of log(r) equal to -1.67 ± 0.04 and 0.47 ± 0.03, 
respectively, with the ranges giving the 95% confidence inter- 
vals. A randomization test based on the Kolmogorov-Smirnov 
statistic and accounting for the fact that the distribution pa- 
rameters are determined from the data shows the r val- 
ues are consistent with this distribution (p- value 0.35). While 
broad distributions occur in several web sites (37), our model 
allows factoring out the effect of visibility due to the user in- 
terface from the overall distribution of votes. Thus we can 
identify variation in users' inclination to vote on a story they 
see. 

The simple model described in this section gives a reason- 
able qualitative account of how user behavior leads to stories' 
promotion to the front page and the eventual saturation in the 
number of votes they receive due to their decreasing visibility. 
In the section below we show how additional properties of the 
interface and user population can be added to the model for a 
more accurate analysis of the aggregate behavior. For exam- 
ple, submitter's fans may find the story more interesting than 
the general Digg audience, corresponding to different r val- 
ues for these groups of users. In addition, we modeled users 
coming to Digg independently with uniform rates v and uj. In 
fact, the rates vary systematically over hours and days l35l 
as shown in Fig. [3] and individual users have a wide range in 
time between visits l36l . In our model, this variation gives 
time-dependent values for v, describing the rate users come to 
Digg, and kf and fc u , which relate to the rate new stories are 
posted and promoted. 




FIG. 10: State diagram for a user. The submitter provides a story's 
first vote. The initial set of fans consists of the submitter's fans; 
other users are initially non-fans. Fans and non-fans have different 
probabilities to see and vote on the story. With each vote, a non- 
fan user who is a fan of that voter moves into the fans state. This 
state transition is caused by the votes of other users: a user moving 
from the non-fans to fans state is not aware of that change until later 
visiting Digg and seeing the story in the friends interface. 



submission 


upcoming 


P(v) 


front page 




location: q 




location: p 



FIG. 1 1 : State diagram for a story. A story starts at the top of the 
upcoming pages, with location q = 1. The location increases with 
each new submission. An upcoming story with v votes is promoted 
with probability P(v). A promoted story starts at the top of the front 
pages, with location p = 1. The location increases as more stories 
are promoted. A story not promoted within a day is removed (not 
shown). 



The ability of the stochastic approach to incorporate addi- 
tional details in the user models illustrates its value in provid- 
ing insights into how aggregate behavior arises from the users, 
in contrast to models that evaluate regularities in the aggre- 
gate behaviors [38]. In particular, user models can help dis- 
tinguish aggregate behaviors arising from intrinsic properties 
of the stories (e.g., their interestingness to the user population) 
from behavior due to the information the web sites provides, 
such as ratings of other users and how stories are placed in the 
site, i.e., visibility. Finally, stochastic models have not only 
explanatory, but also predictive power. 



V. A MODEL OF SOCIAL VOTING WITH NICHE 
INTERESTS 

To investigate differences among voters with respect to the 
friends network, we extend the previous stochastic model to 
distinguish votes from fans and non-fans. The model consid- 
ers the joint behavior of users and the location of the story on 



the web site. Fig. [TO] shows the user states and the stochastic 
transitions between them. Stories are on either the upcoming 
or front pages, as shown in Fig. [TT] This leads to a descrip- 



tion of the average rates of growth for votes from fans and 
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non-fans of prior voters, vp and vn, respectively: 



ujtfPfF 



(10) 
(11) 



where t is the Digg time since the story's submission and uj 
is the average rate a user visits Digg (measured as a rate per 
unit Digg time), vn includes the story's submitter. Pp and 
P/v denote the story's visibility and rp and tn denote the 
story's interestingness to users who are fans or not of prior 
voters, respectively. Visibility depends on the story's state 
(e.g., whether it has been promoted), as discussed below. In- 
terestingness is the probability a user who sees the story will 
vote on it. Nominally people become fans of those whose 
contributions they consider interesting, suggesting fans likely 
have a systematically higher interest in stories. Our model ac- 
counts for this possibility with separate interestingness values 
for fans and non-fans. 

In contrast to the model of Sec. |IV| where time t denoted 
real time since story submission, we now use t to denote 
the "Digg time" since submission, thereby accounting for the 
daily variation in activity. Using Digg time reduces the varia- 
tion in the rate users visit Digg, thereby improving the match 
to the assumed constant rate uj used in the model. Moreover, a 
detailed examination of the page locations of the stories in our 
data set, shows systematic variation in the time stories spend 
on each page corresponding to the daily activity variation used 
to define Digg time. Thus using Digg time improves the ac- 
curacy of the linear growth in location given in Eq. ^ and 

©. 

These voting rates depend on F (AT), the numbers of users 
who have not yet seen the story and who are (are not) fans of 
prior voters. The quantities change as users see and vote on 
the story according to 



dF 

~dt 
dN 

~dt 



-ujP f F + pN 
-ujP n N - pN 



dv 

~dt 
dv 

~dt 



(12) 
(13) 



with v = vp + vn the total number of votes the story has 
received. The quantity p is the probability a user who has not 
yet seen the story and is not a fan of a prior voter is a fan of 
the most recent voter. For simplicity, we treat this probability 
as a constant over the voters, thus averaging over the variation 
due to clustering in the social network and the number of fans 
a user has. The first term in each of these equations is the rate 
the users see the story. The second terms arise from the rate 
the story becomes visible in the friends interface of users who 
are not fans of previous voters but are fans of the most recent 
voter. 

Initially, the story has one vote (from the submitter) and the 
submitter has S fans, so vp(0) = 0, vn(0) = 1, F = S 
and N = U — S — 1 where U is the total number of active 
users at the time the story is submitted. Over time, a story 
becomes less visible to users as it moves down the upcoming 
or (if promoted) front page lists, thereby attracting fewer votes 
and hence fewer new fans of prior voters. 
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FIG. 12: Comparison of activity (number of votes) and number of 
fans for each of the 3436 users with at least one vote and one fan. 



We use the same visiting rate parameter, uj, for users who 
are and are not fans of prior voters since there is only a 
small correlation between voting activity and the number of 
fans across all the stories in our data set, as illustrated in 
Fig. 
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Moreover, many highly active users do not partic- 
ipate in the social network at all (i.e., have neither fans nor 
friends). Among all users, the correlation between number of 
votes and number fans is 0.15. More specifically, we assume 
that with respect to votes on a single story, fans of those vot- 
ers aren't systematically more likely to visit Digg than other 
users, such as fans of voters on other stories or users without 
fans or friends. 



A. Story Visibility 

We assume a fan easily sees the story via the friends inter- 
face, so Pp — 1, as in the previous model fl5l . Users who 
are not fans of prior voters must find the story on the front 
or upcoming pages. Thus P/v depends on how users navigate 
through these pages and the story's location at the time the 
user visits Digg. As with the previous model, we use Eq.|3]to 
describe this behavior. 

e. List position of a story The page number of a story 
on the upcoming page q and the front page p at time t is 
given by Eq. ([5]) and ([6]), with t now interpreted as Digg time. 
The slopes, given in Table [In| are the same as with the previ- 
ous model which averaged over the daily variation in activity. 
Since each page holds 15 stories, these rates are 1/15^ the 
story submission and promotion rates, respectively. 

Since upcoming stories are less popular than the front page, 
our model has a fraction c < 1 of Digg visitors viewing the 
upcoming stories pages. Combining these effects, we take the 
visibility of a story at position p in the front page list to be 
Pn — /page(p)> whereas a story at position q in the upcoming 
page list is c/ page (g) 021 • 

/ Promotion to the front page Promotion to the front 
page appears to depend mainly on the number of votes the 
story receives. We model this process by the probability P(v) 
an upcoming story is promoted after its v th vote. We take 
P(l) = 0, i.e., a story is not promoted just based on the sub- 
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FIG. 13: Probability for promotion before the next vote for an up- 
coming story as a function of the number of votes. The error bars 
indicate the 95% confidence intervals for the estimates. The curve is 
a logistic fit. 



mitter's vote. The probability a story is not promoted by the 
time it receives v votes is ni=i(l — ^W)- Stories not pro- 
moted are eventually removed, typically 24 hours after sub- 
mission. 



Based on our data, Fig. 13 shows the probability P(v) an 



upcoming story is promoted after v votes conditioned on it 
not having been promoted earlier. We find a significant spread 
in the number of votes a story has when it is promoted. For 
predicting whether and when a story will be promoted in our 
model, we use a logistic regression fit to these values, as 
shown in the figure. This contrasts with the step function for 
promotion at 40 votes used in the previous model fT5ll . 

g. Friends interface The fans of the story's submitter 
can find the story via the friends interface. As additional peo- 
ple vote on the story, their fans can also see the story. We 
model this with F(t), the number of fans of voters on the story 
by time t who have not yet seen the story. Although the num- 
ber of fans is highly variable, we use the average number of 



additional fans from an extra vote, pN, in Eq. ( 12 ) 



friends interface. Thus we use just the non-fan votes to es- 
timate visibility parameters, via maximum likelihood. Specif- 
ically, we use the non-fan votes for 16 stories in the June data 
set to estimate c and the "law of surfing" parameters p and A. 
We then use fan votes for these stories to evaluate the proba- 
bility a user is a fan of a new voter, p. Separating votes by the 
different interfaces by which users find stories provides more 
precise estimation than the prior model fT5ll . 

This estimation involves comparing the observed votes to 
the voting rate from the model. As described above, the model 
uses rate equations to determine the average behavior of the 
number of votes. A simple approach to relate this average 
to the observed number of votes is to assume the votes from 
non-fan users form a Poisson process whose expected value 
is dvjsf(t)/dt 9 given by Eq. ( [IT] ). This rate changes with time 
and depends on the model parameters. 

For a Poisson process with a constant rate v, the probabil- 
ity to observe n events in time T is the Poisson distribution 
e~ vT (vT) n /n\. This probability depends only on the number 
of events, not the specific times at which they occur. Thus es- 
timating a constant rate involves maximizing this expression, 
which gives v = n/T, i.e., the maximum-likelihood estimate 
of the rate for a constant Poisson process is equal to the aver- 
age rate of the observed events. 

In our case, the voting rate changes with time, requiring a 
generalization of this estimation. Specifically consider a Pois- 
son process with nonnegative rate v(t) which depends on one 
or more parameters to be estimated. Thus in a small time in- 
terval (t, t + AT), the probability for a vote is v(t)At, and 
this is independent of votes in other time intervals, by the def- 
inition of a Poisson process. Suppose we observe n votes at 
times < ti < t2, . . . < t n < T during an observation time 
interval (0, T). Considering small time intervals At around 
each observation, the probability of this observation is 

P(no vote in (0,ti))v(ti)At x 
P(no vote in t 2 ))v(t 2 )At x 



B. Parameter Estimation 

Since we observe votes, not visits to Digg, there is some 
ambiguity in the rate uj and the interestingness values rp, rjy. 
For example, a given value of urp could arise from users of- 
ten visiting Digg but rarely voting on stories, or less frequent 
visits with a high chance of voting during each visit. This arbi- 
trary scaling does not affect our focus on the relative behavior 
of fans and non-fans. For definiteness, we pick a specific value 
for uj and give interestingness values relative to that choice. 

We used the May data to estimate the story location param- 
eters k u and kf . Their values correspond to 54 and 2.7 stories 
per hour submitted and promoted, respectively. 



1. Estimating parameters from observed votes 

In our model, story location affects visibility only for non- 
fan voters since fans of prior voters see the story via the 



P(no vote in (t n -i,t n ))v(t n )At x 
P(no vote in (t n ,T)) 

The probability for no vote in the interval (a, b) is 



exp 



/» 

J a 



(t)dt 



Thus the log-likelihood for the observed sequence of votes is 



[ v(t)dt + S^logv 

^0 



The maximum-likelihood estimation for parameters determin- 
ing the rate v(t) is a trade-off between these two terms: at- 
tempting to minimize v(t) over the range (0, T) to increase 
the first term while maximizing the values v{tj) at the specific 
times of the observed votes. If v(t) is constant, this likeli- 
hood expression simplifies to —vT + n log v with maximum 
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at v = n/T as discussed above for the constant Poisson pro- 
cess. When v(t) varies with time, the maximization selects 
parameters giving relatively larger v(t) values where the ob- 
served votes are clustered in time. 

In our case, we combine this log-likelihood expression from 
the votes on several stories, and maximize the combined ex- 
pression with respect to the story-independent parameters of 
the model, with the interestingness parameters determined 
separately for each story. 




1 2 5 10 20 50 100 

2. Estimating number of active users front page votes 



Our model involves a population of "active users" who visit 
Digg during our sample period. Specifically, the model uses 
the rate users visit Digg, uoU. We do not observe visits in 
our data, but can infer the relevant number of active users, U, 
from the heterogeneity in the number of votes by users. The 
June data set consists of 16283 users who voted at least once 



during the sample period. Fig. [14] shows the distribution of 
this activity on front page stories. Most users have little ac- 
tivity during the sample period, suggesting a large fraction of 
users vote infrequently enough to never have voted during the 
time of our data sample. This behavior can be characterized 
by an activity rate for each user. A user with activity rate v 
will, on average, vote on vT stories during a sample time T. 
We model the observed votes as arising from a Poisson pro- 
cess whose expected value is vT and the heterogeneity arising 
from a lognormal distribution of user activity rates 1 16 ]. This 
model gives rise to the extended activity distribution while ac- 
counting for the discrete nature of the observations. The latter 
is important for the majority of users who have low activity 
rates so will vote only a few times, or not at all, during our 
sample period. 

Specifically, for n k users with k votes during the sample 
period, this mixture of lognormal and Poisson distributions (6] 
[30l gives the log-likelihood of the observations as 



^2n k log P(/i, cr; k) 



where cr; k) is the probability of a Poisson distribution 
to give k votes when its mean is chosen from a lognormal 
distribution Pi ogn ormai with parameters ji and cr. From Eq. d9l, 



P(fjL,a-,k) = 



2trt/c! 



-f 

I Jo 



(log(p)- M )^ 

^ p dp 



for integer k > 0. We evaluate this integral numerically. In 
terms of our model parameters, the value of fi in this distribu- 
tion equals vT . 

Since we don't observe the number of users who did not 
vote during our sample period, i.e., the value of no, we can- 
not maximize this log-likelihood expression directly. Instead, 
we use a zero-truncated maximum likelihood estimate fill to 
determine the parameters /i and a for the vote distribution of 



Fig. [14] Specifically, the fit is to the probability of observing 
k votes conditioned on observing at least one vote. This con- 
ditional distribution is P(/i, cr; fc)/(l — cr; 0)) for k > 0, 



FIG. 14: User activity distribution on logarithmic scales. The curve 
shows the fit to the model described in the text. 



and the corresponding log-likelihood is 

J2 n k log P(/i, cr; k) - U + log(l - P(/i, a; 0)) 

k>0 

where U+ is the number of users with at least one vote in our 
sample, i.e., 16283. Maximizing this expression with respect 
to the distribution's parameters /i and cr gives vT lognormally 
distributed with the mean and standard deviation of log(z/T) 
equal to -2.06 ± 0.03 and 1.82 ± 0.03, respectively. With 
these parameters, cr; 0) = 0.757, indicating about 3/4 of 
the users had sufficiently low, but nonzero, activity rate that 
they did not vote during the sample period. We use this value 
to estimate U, the number of active users during our sample 
period: U = U + /{1 - P(/x, cr; 0)). 



Based on this fit, the curve in Fig. [14] shows the expected 
number of users with each number of votes, i.e., the value of 
UP(/j,,a;k) for k > 0. This is a discrete distribution: the 
lines between the expected values serve only to distinguish 
the model fit from the points showing the observed values. A 
bootstrap test 1 12] based on the Kolmogorov-Smirnov (KS) 
statistic shows the vote counts are consistent with this distri- 
bution (p- value 0.48). This test and the others reported in this 
paper account for the fact that we fit the distribution parame- 
ters to the data [8 ]. 



3. Estimated parameters 

Table[IIl]lists the estimated parameters. We estimate rp and 
vn for each story from its fan and non-fan votes. 

The page view distribution seen in this data set indicates 
users who choose to visit the upcoming pages tend to explore 
those pages fairly deeply. This contrasts with the more limited 
exploration, i.e., smaller value of /i, seen in the May data set 
which included votes well after promotion [ 15 ]. This suggests 
differing levels of perseverance of users who visit the upcom- 
ing stories compared to the majority of users who focus on 
front page stories. Alternatively, there could be other ways 
non-fan users find content that has already moved far down 
the list of stories. 
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parameter 


value 


average rate each user visits Digg 


u 0.2 /hr 




number of active users 


U = 70,000 




fraction viewing upcoming pages 


c = 0.065 




page view distribution 


fi — 6.3 
A = 0.14 




probability a user is a voter's fan 


p = 9.48 x 10" 6 


upcoming stories location 


k u = 3.60 pa^ 


*es/hr 


front page location 


k f = 0.18 pa^ 


;es/hr 


story specific parameters 




interestingness to fans 


r F 




interestingness to non-fans 






number of submitter's fans 


S 





TABLE III: Model parameters, with times in "Digg hours" 



200 




FIG. 15: Voting behavior: the number of votes vs. time, measured in 
Digg hours, for a promoted story in June 2006. The curve shows the 
corresponding solution from our model and the dashed vertical line 
indicates when the story was promoted to the front page. This story 
eventually received 2566 votes. 



C. Results 



Figure 15 compares the solution of the rate equations with 
the actual votes for one story. The model correctly reproduces 
the dynamics of voting while the story is on the upcoming 
stories list and immediately after promotion. 

We use the model to evaluate systematic differences in story 
interestingness between fans and non-fans, with the resulting 
distribution of values shown in Fig. 16 The interestingness 
values for fans and non-fans of prior voters each have a wide 
range of values, but the interestingness to fans is generally 
much higher than to non-fans. Both sets of values fit well to 
lognormal distributions, as indicated in Fig. [T7] Specifically, 
the tat values fit well to a lognormal distribution with maxi- 
mum likelihood estimates of the mean and standard deviation 
of log(rAr) equal to —4.0 ±0.1 and 0.63 ± 0.07, respectively, 
with the ranges giving the 95% confidence intervals. A boot- 
strap test based on the KS statistic shows the r values are con- 
sistent with this distribution (jp- value 0.1). 

Because there are relatively few votes by fans, we have a 
larger variance in estimates of rp than for r at. In particular, 




FIG. 16: Distribution of interestingness for (a) fans, and (b) non- 
fans. The curves are lognormal fits to the values. Note the different 
ranges for the horizontal scales in the two plots: tf values tend to be 
significantly larger than tn values. 



17 stories have no votes by fans leading to a maximum like- 
lihood estimate rp =0, though with a large confidence in- 
terval. The remaining values are approximately lognormally 
distributed with maximum likelihood estimates of the mean 
and standard deviation of log(rp) equal to —1.8 ± 0.1 and 
0.75 ±0.08, respectively. The KS statistic indicates the weaker 
fit, with a p- value of 0.04. Due to the relatively few votes, the 
discrete nature of the observations likely significantly affects 
the estimates. For example, a story with no fans among the 
early votes may reflect a submitter with no fans and a low, 
but nonzero, interestingness for fans. A subsequent vote by a 
highly connected user would expose the story to many fans, 
possibly leading to many votes that the model would miss by 
assuming rp = 0. One approach to this difficulty is using 
the lognormal distributions of r values as priors in the esti- 
mation. This procedure somewhat improves performance, as 
discussed below. 
Overall, Fig. 



1 8 shows there is little relation between how 
interesting a story is to fans and other users: the correlation 
between rp and rAr is — 0.11. A randomization test indicates 
this small correlation is only marginally significant, with p- 
value 0.05 of arising from uncorrelated values. The relation- 
ship between interestingness for fans and other users indicates 
a considerable variation in how widely stories appeal to the 
general user community. Specifically, the ratio rp/r^ ranges 
from to 87, with median 9.3. The high values correspond to 
stories that do not get a large number of votes, indicating they 
are of significantly more interest to the fans of voters than to 
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FIG. 17: Quantile-quantile plot comparing the observed distribution 
for tf (fans) and tn (non-fans) with the corresponding lognormal 
distribution fits (thick curves). For comparison, the thin straight line 
from to 1 corresponds to a perfect match between the data and the 
distribution. 




0.2 0.4 0.6 

fraction of first 10 votes by fans 

FIG. 19: Relation between final number of votes and the fraction 
of votes by fans among a story's first 10 votes. Small points are 
individual stories and the large points are the mean values for each 
number of votes by fans. The curve shows the model prediction. 
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FIG. 18: Log-log plot comparing estimated interestingness to fans 
{tf) and non-fans (r at) for 161 promoted stories with votes from fans 
(so the estimate of tf is positive). All the stories in our data set had 
non-fan votes, giving all the estimates for tn as positive numbers. 
The line indicates where tf — tn- 



the general user population, i.e., "niche interest" stories (cor- 
responding to the upper left points in Fig. [18]). As described 
below, this observation is useful to improve prediction of how 
popular a story will become based on reaction of early voters. 
Identifying niche interest stories could also aid user interface 
design by selectively highlighting stories on the friends inter- 
face that have particularly large estimated values of rp. Sto- 
ries with high ratios of Tp / ^at tend to be promoted after fewer 
votes than those stories with low ratios. 

An earlier study [ 24 ] noted a curious phenomenon: namely, 
stories that initially spread quickly through the network, i.e., 
receive a large proportion of early votes from fans, end up 
not becoming very popular; vice versa, stories that initially 
spread slowly through the fan network end up becoming pop- 
ular. This phenomenon appears to be a generic feature of in- 
formation diffusion on social networks and has also been ob- 
served on blog networks | 9 ] and in Second Life (U. 



Fig. [19] shows that our model explains this relationship, 
which arises from the difference in interestingness for fans 
and non-fans. Specifically, a low fraction of early votes by 



fans indicates r n is relatively large to produce the early non- 
fan votes in spite of the lower visibility of upcoming stories to 
non-fan users. Once the story is promoted, it then receives rel- 
atively more votes from the general user community (most of 
whom are not fans of prior voters). The separation of effects 
of visibility and interestingness with our model improves this 
discrimination compared to just using the raw number of votes 
by fans and non-fans without regard for the story visibility at 
the time of the votes. For example, the correlation between the 
final number of votes and tn /tf is 0.72 compared to 0.64 for 
the correlation between the final number of votes and vn/vf- 



D. Discussion 

This model with niche interests captures the consequences 
of link choices: people tend to become fans of users who sub- 
mit or vote on stories of interest to that person. The ease 
of incorporating such additional detail is a useful feature of 
stochastic models. 

Comparing the two models illustrates the practical chal- 
lenges of incomplete or limited data. For example, data 
scraped from a web site can have errors due to unusual user 
names or unanticipated characters in story titles. Even when 
web sites provide an interface to collect data (as Digg pro- 
vided after the data used in this paper was extracted), subtle 
differences in interpretation of the data fields can still arise, as 
when users who no longer have Digg account are all given the 
same name "inactive" and hence appear to be the same user if 
not specifically checked for in the script collecting the data. 

In particular, the "law of surfing" parameter estimates for 
the two models are significantly different, a consequence of 
the log-likelihood being a fairly flat function of these parame- 
ters. This arises due to the relatively weak constraints that vote 
history provides on views, i.e., how many pages of upcoming 
or front page stories users choose to view during a visit to 
Digg. For stronger constraints on this behavior, data would 
ideally include the pages users actually viewed. While such 
data is in principle available via the web site access logs, this 
information is not publicly available for Digg. Similarly, the 
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promotion algorithm used by Digg is deliberately not made 
public to reduce the potential for story submitters to game the 
system. To the extent that model parameter estimates differ 
from those that would be possible with this additional data, 
the stochastic approach identifies potential advantages mod- 
els can provide the web site provider, with access to more 
precise data on user behavior, e.g., for predicting popularity 
of newly submitted stories. More generally, the sensitivity of 
parameters to the available measured data can suggest addi- 
tional aspects of user behavior that would be most useful to 
determine, leading to more focus in future data collection and 
instrumentation of web communities. Alternatively, when the 
models indicate several different types of data could provide 
the required information, selecting the types most acceptable 
to the user community (e.g., privacy preserving) can facilitate 
the data collection while providing opportunities for more ac- 
curate models to guide the development of the web site and its 
usefulness to its community. 

A related data quality issue is the length of time over which 
data is collected. On the one hand, collecting data for long pe- 
riods can improve model parameter estimation by providing 
many more samples. On the other hand, web sites often re- 
arrange or add features to their user interfaces, which change 
how users find content. Digg also occasionally changes the 
promotion algorithm. That is, the stochastic behavior asso- 
ciated with the site is nonstationary rather than arising from 
a fixed distribution. Moreover, over longer periods of time 
new users join the site and some users become inactive. Thus 
one can't simply improve the model parameter estimation by 
collecting data over longer periods of time (l6l|37]|. Instead, 
the models must be extended to include these additional time- 
dependent behaviors. 

In addition to improving quantitative estimation, similar 
qualitative behaviors seen with different models identify ar- 
eas for further investigation. For example, in the two models 
presented here, the distribution of interestingness over the sto- 
ries shows a lognormal distribution. This suggests there is an 
underlying multiplicative process giving rise to the observed 
values 1 31 , 33 ]. Specifically, the lognormal distribution arises 
from the multiplication of random variables in the same way 
that the central limit theorem leads to the normal distribution 
from the addition of random variables under weak restrictions 
on their variance and correlations. Thus an important question 
raised by these models is identifying the story characteristics 
and user behaviors that combine multiplicatively to lead to 
the observed lognormal distributions. Identifying such prop- 
erties would give a more detailed understanding of what leads 
to interesting content independent of the effects of visibility 
provided by the web site. 



VI. MODEL-BASED PREDICTION 

As discussed above, predicting popularity in social media 
from intrinsic properties of newly submitted content is diffi- 
cult l34l . However, users' early reactions provide some mea- 
sure of predictability CSJ El El [23 [SJ. The early votes 
on a story allow us to estimate its interestingness to fans and 



model direct 
distinct r same r extrapolation 



first 216 votes 
first 10 votes 



10% 
18% 



12% 
23% 



21% 
29% 



TABLE IV: Prediction errors on whether a story receives at least 500 
votes. The table compares three methods: 1) the full model which 
allows distinct values for tf and rjv, 2) the model constrained to 
have tf — vn, and 3) direct extrapolation from the rate the story 
accumulates votes. This comparison involves 178 promoted stories, 
of which 137 receive at least 500 votes. 



other users. We can then use the model to predict how the 
story will accumulate additional votes. These predictions are 
for expected values and cannot account for the large variation 
due, for example, to a subsequent vote by a highly connected 
user which leads to a much larger number of votes. 

As one prediction example, we evaluate whether a story 
will receive at least 500 votes. Predicting whether a story will 
attract a large number of votes, rather than the precise num- 
ber of votes, is a useful criterion for predicting whether the 
story will "go viral" and become very popular. This is exactly 
Digg's intention behind using using crowd sourcing to select a 
subset of submitted content to feature on the front page l24l . 
The 500 vote threshold is a useful rule of thumb, as that is 
close to the median popularity value in a large sample of Digg 
stories l26l[38l . 



Table [TV] compares the predictions with different methods, 
including a constrained version of our model with rp = r^, 
which assumes no systematic difference in interest between 
fans and other users. 

We also compare with direct extrapolation from the early 
votes. In this procedure, with v votes observed at time t, we 
extrapolate to vt^ ns ^/t, where we take tfi na i to be 72 hours, 
a time by which stories have accumulated all, or nearly all, 
the votes they will ever get. We use a least squares linear fit 
between these observed and extrapolated values. A pairwise 
bootstrap test indicates the model has a lower prediction error 
than this extrapolation with p- value of 10 -2 . 

This extrapolation method is similar to that used to predict 
final votes from the early votes l35lL but with two differences: 
1) we extrapolate from the time required for the story to ac- 
quire a given number of votes instead of the number of votes 
at a given time, and 2) we use early votes after submission 
(i.e., including when the story is upcoming, where the social 
network has a large effect) instead of early votes after promo- 
tion. 

In the case of prediction based on the first 10 votes, which is 
before the stories are promoted, an additional question is how 
well the model predicts whether the story will eventually be 
promoted. We find a 25% error rate in predicting promotion 
based on the first 10 votes. 

We can improve predictions from early votes by us- 
ing the lognormal distributions of rp and tn, shown in 



Fig. [T6J as the prior probability to combine with the like- 
lihood from the observations according to Bayes theorem. 
Specifically, instead of maximizing the likelihood of the ob- 
served votes, P(r | votes), as discussed above, this approaches 
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FIG. 20: Comparison of log-likelihood (i.e., log P(r | votes)) and log- 
likelihood plus log(P pr i or (r)) for estimating tf for a story with no 
fan votes. The maximum of the log-likelihood is at tf — while the 
maximum with the prior is tf — 0.086. 



maximizes the posterior probability, which is proportional to 
P(r | votes) Pp r ior(0 where P pr i r is taken to be the lognormal 
distribution Pi og normai in Eq. ^ with parameters from the fits 



shown in Fig. 16 



This method gives little change in estimates of r^, due to 
the relatively large number of non-fan votes on each story. 
However, using the prior makes large changes in some of the 
rp estimates, thereby avoiding the small number of extreme 
predictions made by poor estimates. Using this prior to aid 
estimation is particularly significant when there are no votes 
by fans among the early votes, leading to an estimate of rp = 
0, but later a user with many fans votes on the story. In this 



case, as illustrated in Fig. 20 using the lognormal as a prior 
gives a positive estimate for rp, thereby predicting some votes 
by any subsequent users who are fans of earlier voters. 

By avoiding these extreme cases, this procedure improves 
the correlation between predicted and actual final votes as well 
as the predicted rank ordering of the stories (i.e., whether the 
story is likely to be relatively popular) as seen with a larger 
value of the Spearman rank correlation when using the prior 
distribution. For example, when predicting based on the first 
10 votes, using this prior increases the Spearman rank correla- 
tion between predicted and actual number of votes from 0.46 
to 0.53. For comparison, this correlation for direct extrapola- 
tion from the first 10 votes is 0.32 and is 0.34 for the model 
constrained to have rp = rjy. Pairwise bootstrap tests indi- 
cate the differences between these values are significant with 
p- values less than 10 -3 , except the difference between the last 
two cases has p- value of 10 -2 . 



VII. RELATED WORK 

The Social Web provides massive quantities of available 
data about the behavior of large groups of people. Researchers 
are using this data to study a variety of topics, including de- 
tecting (U [29) and influencing Qj] [20) trends in public opin- 
ion, and dynamics of information flow in groups 1281139) . 



Several researchers examined the role of social dynamics in 
explaining and predicting distribution of popularity of online 
content. Wilkinson [37] found broad distributions of popular- 
ity and user activity on many social media sites and showed 
that these distributions can arise from simple macroscopic 
dynamical rules. Wu & Huberman [38] constructed a phe- 
nomenological model of the dynamics of collective attention 
on Digg. Their model is parameterized by a single variable 
that characterizes the rate of decay of interest in a news arti- 
cle. Rather than characterize evolution of votes received by 
a single story, they show the model describes the distribution 
of final votes received by promoted stories. Our model offers 
an alternative explanation for the distribution of votes. Rather 
than novelty decay, we argue that the distribution can also be 
explained by the combination of a non-uniform variations in 
the stories' inherent interest to users and effects of user inter- 
face, specifically decay in visibility as the story moves to sub- 
sequent front pages. Such a mechanism can also explain the 
distribution of popularity of photos on Flickr, which would 
be difficult to characterize by novelty decay. Crane & Sor- 
nette [10 ] analyzed a large number of videos posted on You- 
Tube and found that collective dynamics was linked to the in- 
herent quality of videos. By looking at how the observed num- 
ber of votes received by videos changed in time, they could 
separate high quality videos, whether they were selected by 
YouTube editors or spontaneously became popular, from junk 
videos. This study is similar in spirit to our own in exploit- 
ing the link between observed popularity and content qual- 
ity. However, while this, and Wu & Huberman study, aggre- 
gated data from tens of thousands of individuals, our method 
focuses instead on the microscopic dynamics, modeling how 
individual behavior contributes to the observed popularity of 
content. In |27) we used the simple model of social dynam- 
ics, reviewed in this paper, to predict whether Digg stories will 
become popular. The current paper improves on that work. 

Researchers found statistically significant correlation be- 
tween early and late popularity of content on Slashdot f\M , 
Digg and YouTube [ 35]. Specifically, similar to our study, Sz- 
abo & Huberman |35) predicted long-term popularity of sto- 
ries on Digg. Through large-scale statistical study of stories 
promoted to the front page, they were able to predict stories' 
popularity after 30 days based on their popularity one hour af- 
ter promotion. Unlike our work, their study did not specify a 
mechanism for evolution of popularity, and simply exploited 
the correlation between early and late story popularity to make 
the prediction. Our work also differs in that we predict pop- 
ularity of stories shortly after submission, long before they 
are promoted. Several researchers (H [9J [24) found that early 
diffusion of information across an interlinked community is a 
useful predictor of how far it will spread across the network 
in general. Both |24) and (9) exploited the anti-correlation 
between these phenomena to predict final popularity. Specifi- 
cally, the former work used anti-correlation between the num- 
ber of early fan votes and stories' eventual popularity on Digg 
to predict whether stories submitted by well connected users 
will become popular. That work exploited social influence 
only to make the prediction, and the results were not applica- 
ble to stories submitted by poorly connected users which were 



18 



not quickly discovered by highly connected users. In contrast, 
the approach described in this paper considers effects of so- 
cial influence regardless of the connectedness of the submit- 
ter, and also accounts for story quality in making a prediction 
about story popularity. An interesting open question is the 
nature of the social influence on voting. In our model, the 
influence has two components: increased visibility of a story 
to fans due to the friends interface and the higher interesting- 
ness of the story to fans. This higher interestingness could 
be due to self-selection, whereby users become fans of people 
whose submissions or votes are of particular interest. Alter- 
natively, users could be directly influenced by the activities 
of others |34], with the possibility that this influence depends 
not just on whether friends vote on a story but also how many 
friends do so 171. 

VIII. CONCLUSION 

In the vast stream of new user-generated content, only a few 
items will prove to be popular, attracting a lion's share of at- 
tention, while the rest languish in obscurity. Predicting which 
items will become popular is exceedingly difficult, even for 
people with significant expertise. This prediction difficulty 
arises because popularity is weakly related to inherent con- 
tent quality and social influence leads to an uneven distribu- 



tion of popularity that is sensitive to the early choices of users 
in the social network. We described how stochastic models 
of user behavior on a social media web site can partially ad- 
dress this prediction challenge by quantitatively characteriz- 
ing evolution of popularity. The model shows how popularity 
is affected by item quality and social influence. We evaluated 
the usefulness of this approach for the social news aggregator 
Digg, which allows users to submit and vote on news stories. 
The number of votes a story accumulates on Digg shows its 
popularity. In earlier work we developed a model of social 
voting on Digg, which describes how the number of votes re- 
ceived by a story changes in time. In that model, knowing how 
interesting a story is to the user community, on average, and 
how connected the submitter is fully determines the evolution 
of the story's votes. This leads to an insight that a model can 
be used to predict story's popularity from the initial reaction 
of users to it. Specifically, we use observations of evolution of 
the number of votes received by a story shortly after submis- 
sion to estimate how interesting it is, and then use the model 
to predict how many votes the story will get after a period of a 
few days. Model-based prediction outperforms other methods 
that exploit social influence only, or correlation between early 
and late votes received by stories. We improved prediction 
by developing a more fine-grained model that differentiates 
between how interesting a story is to fans and to the general 
population. 
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